According to the model selection the best model structure is: y ~ x1 + x2 + x4 + error. Now we’re going to test the model further.

pacman::p_load(ggplot2, ggthemes, tidyr, patchwork, plotly, ggforce)

Fit the model

data = read.csv("data/x_y.csv", header = F)
colnames(data) = c("x", "y")

Construct X matrix

X = cbind(data$x,
          data$x^2,
          data$x^4
          ) 

Split into train-test

train_X = X[1:200,]
train_y = as.matrix(data$y[1:200])

test_X = X[201:250,]
test_y = as.matrix(data$y[201:250])

Fit model, generate predictions, compute error (SSE, MSE) and adjusted R^2.

theta = solve(crossprod(train_X), crossprod(train_X, train_y))

y_pred = train_X %*% theta

residuals = (train_y - y_pred)
SSE = sum(residuals^2)
sigma_sq = SSE/(nrow(train_X) - 1)
## MSE is: 0.01069316 
## Adjusted R^2 is: 0.9999891

Residual analysis

Numerical analysis

## Residual analysis:
##          Min          25%       Median          75%          Max 
## -0.305319926 -0.061646247  0.008535535  0.077296305  0.291481680

Parameter covariance matrix

##               [,1]          [,2]          [,3]
## [1,]  5.119680e-05  9.940021e-06 -1.779249e-06
## [2,]  9.940021e-06  4.546752e-05 -5.317726e-06
## [3,] -1.779249e-06 -5.317726e-06  8.751376e-07

Parameter uncertainty pdf

Because we have 3 parameters, we’ll need to plot all their combinations resulting in 3 plots.

First we create grid of possible parameter values for which we estimate the uncertainty.

n_points = 50
range = 0.01
x1_grid = seq(theta[1]-range, theta[1]+range, length = n_points)
x2_grid = seq(theta[2]-range, theta[2]+range, length = n_points)
x4_grid = seq(theta[3]-range, theta[3]+range, length = n_points)

t = rbind(x1_grid[50], x2_grid[50], x4_grid[50])

Now we can calculate the pdf

cov_inv = (t(train_X) %*% train_X) * (1/sigma_sq)
cov_det = det(cov)

n_params = 3

Parameters x1 and x2

Parameters x1 and x4

Parameters x2 and x4

Model’s prediction

Now we’ll generate predictions on training data and 95% confidence intervals

Plot the predictions + CI

(y_y_pred = ggplot(pred_data, aes(x = y, y = y_pred))+
  geom_point()+
  theme_few())

ggsave("figures/08c_true_pred.png", y_y_pred, width = 7, height = 4)

Model validation

Validate the model on new train-test split - 70:30

Construct X matrices (train and test)

Re-use the fitting function from model selection

Fit model on training data and predict testing data

## MSE on training data =  0.01094003 
## MSE on testing  data =  0.008348671

Validate model with k-fold cross-validation

ggplot(result, aes(MSE_train, MSE_test))+
  geom_line()